Aim of this file

This R markdown file is written to simulate the single-cell data to demonstrate if the order of the batch when applying mutual nearest neighbors (MNN) would affect the clustering results.

Simulation Design

In this simulation study, we are going to simulate four different batches to mimic the temporal from week 0 to week 24. As time goes on, the observed cell population increases. In Figure 1, we present the idea of the simulated data. In each batch, there are different numbers of cell populations: 3, 4, 5, and 7. There exist batch effects among the cell populations 1, 2, 3, 4 and 5 across four batches.

Figure 1

Figure 1

R package to be used: Symsim

To simulate the data, we are going to use Symsim package, which is one most recently published R package for simulating multiple faceted variability in single cell RND sequencing data.

Simulation

Step I: generate true counts data including 7 cell populations

The proportion of zeroes in the true counts data:

## [1] 0.126397

The number of cells in each population:

## 
##   1   2   3   4   5   6   7 
## 200 300 300 300 300 300 300

Step II: Create cells population and 4 different batches (TSNE plots based on batch and cell type):

Bacth 1: proportion of zeroes; proportion of counts > 5; cells plots

## [1] 0.4763277
## [1] 0.1619887

Bacth 2: proportion of zeroes; proportion of counts > 5; cells plots

## [1] 0.4922443
## [1] 0.1609091

Bacth 3: proportion of zeroes; proportion of counts > 5; cells plots

## [1] 0.5029461
## [1] 0.1617658

Bacth 4: proportion of zeroes; proportion of counts > 5; cells plots

## [1] 0.4535441
## [1] 0.1488027

Step III: apply MNN method with different orders to the data. Compare the clustering analysis results without and with MNN when combining batch 1, 2, 3, and 4.

Without correction

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2000
## Number of edges: 73676
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9477
## Number of communities: 14
## Elapsed time: 0 seconds

Possible permutations of four batches

######
perm_table = permutations(n = 4, r = 4, v = 1:4)
perm_table
##       [,1] [,2] [,3] [,4]
##  [1,]    1    2    3    4
##  [2,]    1    2    4    3
##  [3,]    1    3    2    4
##  [4,]    1    3    4    2
##  [5,]    1    4    2    3
##  [6,]    1    4    3    2
##  [7,]    2    1    3    4
##  [8,]    2    1    4    3
##  [9,]    2    3    1    4
## [10,]    2    3    4    1
## [11,]    2    4    1    3
## [12,]    2    4    3    1
## [13,]    3    1    2    4
## [14,]    3    1    4    2
## [15,]    3    2    1    4
## [16,]    3    2    4    1
## [17,]    3    4    1    2
## [18,]    3    4    2    1
## [19,]    4    1    2    3
## [20,]    4    1    3    2
## [21,]    4    2    1    3
## [22,]    4    2    3    1
## [23,]    4    3    1    2
## [24,]    4    3    2    1

With correction: Order 1 1, 2, 3, 4

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2000
## Number of edges: 75321
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8588
## Number of communities: 6
## Elapsed time: 0 seconds

## [1] 0.230185

## [1] 0.052

With correction: Order 2 3, 2, 1, 4

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2000
## Number of edges: 70643
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8589
## Number of communities: 8
## Elapsed time: 0 seconds

## [1] 0.175056

## [1] 0.008

With correction: Order 3 4, 3, 1, 2

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2000
## Number of edges: 67654
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8541
## Number of communities: 7
## Elapsed time: 0 seconds

## [1] 0.070489

## [1] 0

## [1] 0.175056

## [1] 0.008

With correction: Order 4 1, 2, 4, 3

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2000
## Number of edges: 72338
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8511
## Number of communities: 6
## Elapsed time: 0 seconds

## [1] 0.23826

## [1] 0

Step IV: check similarity between different orders (with average gene expression, which is the row mean of the correct count matrix)

Step V: calculate ARI (adjusted rank index): ARI >= 0.90 excellent recovery; 0.80 =< ARI < 0.90 good recovery; 0.65 =< ARI < 0.80 moderate recovery; ARI < 0.65 poor recovery.

## [1] 0.52
## attr(,"ci")
## lower upper 
##  0.52  0.52
## [1] 0.41
## attr(,"ci")
## lower upper 
##  0.41  0.41
## [1] 0.33
## attr(,"ci")
## lower upper 
##  0.33  0.33
## [1] 0.47
## attr(,"ci")
## lower upper 
##  0.47  0.47

References:

Haghverdi, Laleh, et al. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nature biotechnology 36.5 (2018): 421.

Zhang, Xiuwei, Chenling Xu, and Nir Yosef. “Simulating multiple faceted variability in single cell RNA sequencing.” Nature communications 10.1 (2019): 2611.